# check if you have ISLR package, if not, install it
library(ISLR)
library(data.table)
library(skimr)
library(ggrepel)
library(patchwork)
library(latex2exp)
library(factoextra)
library(ggbiplot)
library(gt)
library(here)
library(tidyverse)
rm(list=ls())
i_am('hw2_sp2022.Rmd')

Overview

Principle Component Analysis is widely used in data exploration, dimension reduction, data visualization. The aim is to transform original data into uncorrelated linear combinations of the original data while keeping the information contained in the data. High dimensional data tends to show clusters in lower dimensional view.

Clustering Analysis is another form of EDA. Here we are hoping to group data points which are close to each other within the groups and far away between different groups. Clustering using PC’s can be effective. Clustering analysis can be very subjective in the way we need to summarize the properties within each group.

Both PCA and Clustering Analysis are so called unsupervised learning. There is no response variables involved in the process.

For supervised learning, we try to find out how does a set of predictors relate to some response variable of the interest. Multiple regression is still by far, one of the most popular methods. We use a linear models as a working model for its simplicity and interpretability. It is important that we use domain knowledge as much as we can to determine the form of the response as well as the function format of the factors on the other hand.

0.1 Objectives

  • PCA
  • SVD
  • Clustering Analysis
  • Linear Regression

0.2 Review materials

  • Study Module 2: PCA
  • Study Module 3: Clustering Analysis
  • Study Module 4: Multiple regression

0.3 Data needed

  • NLSY79.csv
  • brca_subtype.csv
  • brca_x_patient.csv

1 Case study 1: Self-seteem

Self-esteem generally describes a person’s overall sense of self-worthiness and personal value. It can play significant role in one’s motivation and success throughout the life. Factors that influence self-esteem can be inner thinking, health condition, age, life experiences etc. We will try to identify possible factors in our data that are related to the level of self-esteem.

In the well-cited National Longitudinal Study of Youth (NLSY79), it follows about 13,000 individuals and numerous individual-year information has been gathered through surveys. The survey data is open to public here. Among many variables we assembled a subset of variables including personal demographic variables in different years, household environment in 79, ASVAB test Scores in 81 and Self-Esteem scores in 81 and 87 respectively.

The data is store in NLSY79.csv.

Here are the description of variables:

Personal Demographic Variables

  • Gender: a factor with levels “female” and “male”
  • Education05: years of education completed by 2005
  • HeightFeet05, HeightInch05: height measurement. For example, a person of 5’10 will be recorded as HeightFeet05=5, HeightInch05=10.
  • Weight05: weight in lbs.
  • Income87, Income05: total annual income from wages and salary in 2005.
  • Job87 (missing), Job05: job type in 1987 and 2005, including Protective Service Occupations, Food Preparation and Serving Related Occupations, Cleaning and Building Service Occupations, Entertainment Attendants and Related Workers, Funeral Related Occupations, Personal Care and Service Workers, Sales and Related Workers, Office and Administrative Support Workers, Farming, Fishing and Forestry Occupations, Construction Trade and Extraction Workers, Installation, Maintenance and Repairs Workers, Production and Operating Workers, Food Preparation Occupations, Setters, Operators and Tenders, Transportation and Material Moving Workers

Household Environment

  • Imagazine: a variable taking on the value 1 if anyone in the respondent’s household regularly read magazines in 1979, otherwise 0
  • Inewspaper: a variable taking on the value 1 if anyone in the respondent’s household regularly read newspapers in 1979, otherwise 0
  • Ilibrary: a variable taking on the value 1 if anyone in the respondent’s household had a library card in 1979, otherwise 0
  • MotherEd: mother’s years of education
  • FatherEd: father’s years of education
  • FamilyIncome78

Variables Related to ASVAB test Scores in 1981

Test Description
AFQT percentile score on the AFQT intelligence test in 1981
Coding score on the Coding Speed test in 1981
Auto score on the Automotive and Shop test in 1981
Mechanic score on the Mechanic test in 1981
Elec score on the Electronics Information test in 1981
Science score on the General Science test in 1981
Math score on the Math test in 1981
Arith score on the Arithmetic Reasoning test in 1981
Word score on the Word Knowledge Test in 1981
Parag score on the Paragraph Comprehension test in 1981
Numer score on the Numerical Operations test in 1981

Self-Esteem test 81 and 87

We have two sets of self-esteem test, one in 1981 and the other in 1987. Each set has same 10 questions. They are labeled as Esteem81 and Esteem87 respectively followed by the question number. For example, Esteem81_1 is Esteem question 1 in 81.

The following 10 questions are answered as 1: strongly agree, 2: agree, 3: disagree, 4: strongly disagree

  • Esteem 1: “I am a person of worth”
  • Esteem 2: “I have a number of good qualities”
  • Esteem 3: “I am inclined to feel like a failure”
  • Esteem 4: “I do things as well as others”
  • Esteem 5: “I do not have much to be proud of”
  • Esteem 6: “I take a positive attitude towards myself and others”
  • Esteem 7: “I am satisfied with myself”
  • Esteem 8: “I wish I could have more respect for myself”
  • Esteem 9: “I feel useless at times”
  • Esteem 10: “I think I am no good at all”

1.1 Data preparation

Load the data. Do a quick EDA to get familiar with the data set. Pay attention to the unit of each variable. Are there any missing values?

esteem <- read_csv(here('data','NLSY79.csv'))
skim(esteem)
Data summary
Name esteem
Number of rows 2431
Number of columns 46
_______________________
Column type frequency:
character 2
numeric 44
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Gender 0 1.00 4 6 0 2 0
Job05 56 0.98 16 83 0 30 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Subject 0 1 3504.50 2764.82 2 1592.5 3137 4668.5 12140 ▇▇▃▁▂
Education05 0 1 13.91 2.50 6 12.0 13 16.0 20 ▁▁▇▃▂
Income87 0 1 13398.80 11599.62 -2 4500.0 12000 19000.0 59387 ▇▆▂▁▁
Income05 0 1 49414.80 46287.60 63 22650.0 38500 61350.0 703637 ▇▁▁▁▁
Weight05 0 1 183.10 41.71 81 150.0 180 209.0 380 ▂▇▃▁▁
HeightFeet05 0 1 5.18 0.57 -4 5.0 5 5.0 8 ▁▁▁▇▂
HeightInch05 0 1 5.32 3.51 0 2.0 5 8.0 11 ▇▅▅▅▇
Imagazine 0 1 0.72 0.45 0 0.0 1 1.0 1 ▃▁▁▁▇
Inewspaper 0 1 0.86 0.35 0 1.0 1 1.0 1 ▁▁▁▁▇
Ilibrary 0 1 0.77 0.42 0 1.0 1 1.0 1 ▂▁▁▁▇
MotherEd 0 1 11.70 2.61 0 11.0 12 12.0 20 ▁▁▇▂▁
FatherEd 0 1 11.85 3.56 0 10.0 12 14.0 20 ▁▂▇▃▁
FamilyIncome78 0 1 21251.67 13878.56 0 11167.0 20000 27500.0 75001 ▇▇▂▁▁
Science 0 1 16.29 4.77 0 13.0 17 20.0 25 ▁▂▆▇▅
Arith 0 1 18.59 7.15 0 13.0 19 25.0 30 ▁▆▇▇▇
Word 0 1 26.61 7.02 0 23.0 28 32.0 35 ▁▁▂▅▇
Parag 0 1 11.23 3.15 0 10.0 12 14.0 15 ▁▂▂▆▇
Number 0 1 35.46 10.11 0 29.0 36 44.0 50 ▁▂▅▇▇
Coding 0 1 47.05 15.24 0 38.0 48 57.0 84 ▁▃▇▇▂
Auto 0 1 14.26 5.32 0 10.0 14 18.0 25 ▁▆▇▆▅
Math 0 1 14.26 6.29 0 9.0 14 20.0 25 ▂▇▇▆▇
Mechanic 0 1 14.43 5.10 0 11.0 14 18.0 25 ▁▅▇▆▃
Elec 0 1 11.59 4.09 0 9.0 12 15.0 20 ▁▅▇▇▃
AFQT 0 1 54.68 27.70 0 31.9 57 78.2 100 ▅▆▇▇▇
Esteem81_1 0 1 1.42 0.52 1 1.0 1 2.0 4 ▇▆▁▁▁
Esteem81_2 0 1 1.42 0.51 1 1.0 1 2.0 4 ▇▆▁▁▁
Esteem81_3 0 1 3.51 0.57 1 3.0 4 4.0 4 ▁▁▁▆▇
Esteem81_4 0 1 1.57 0.56 1 1.0 2 2.0 4 ▇▇▁▁▁
Esteem81_5 0 1 3.46 0.65 1 3.0 4 4.0 4 ▁▁▁▆▇
Esteem81_6 0 1 1.62 0.56 1 1.0 2 2.0 4 ▆▇▁▁▁
Esteem81_7 0 1 1.75 0.60 1 1.0 2 2.0 4 ▅▇▁▁▁
Esteem81_8 0 1 3.13 0.76 1 3.0 3 4.0 4 ▁▂▁▇▆
Esteem81_9 0 1 3.16 0.72 1 3.0 3 4.0 4 ▁▂▁▇▆
Esteem81_10 0 1 3.40 0.66 1 3.0 3 4.0 4 ▁▁▁▇▇
Esteem87_1 0 1 1.38 0.50 1 1.0 1 2.0 4 ▇▅▁▁▁
Esteem87_2 0 1 1.40 0.50 1 1.0 1 2.0 4 ▇▅▁▁▁
Esteem87_3 0 1 3.58 0.54 1 3.0 4 4.0 4 ▁▁▁▅▇
Esteem87_4 0 1 1.50 0.53 1 1.0 1 2.0 4 ▇▇▁▁▁
Esteem87_5 0 1 3.53 0.60 1 3.0 4 4.0 4 ▁▁▁▆▇
Esteem87_6 0 1 1.59 0.56 1 1.0 2 2.0 4 ▆▇▁▁▁
Esteem87_7 0 1 1.72 0.58 1 1.0 2 2.0 4 ▅▇▁▁▁
Esteem87_8 0 1 3.10 0.74 1 3.0 3 4.0 4 ▁▃▁▇▅
Esteem87_9 0 1 3.06 0.74 1 3.0 3 4.0 4 ▁▃▁▇▅
Esteem87_10 0 1 3.37 0.65 1 3.0 3 4.0 4 ▁▁▁▇▇
#names(esteem)
# temp <- read.csv('data/NLSY79.csv', header = T, stringsAsFactors = F)
# # missing values? real variables vs. factors? are varable values reasonable?
#str(esteem)
#summary(esteem)
#levels(as.factor(esteem$Job05))
#table(as.factor(esteem$Job05))

The dataset has 2431 subjects/people and 46 variables. The variables are of two types–numeric and character.In some cases, variables which ought to appropriately coded as categorical are coded as numeric. For example, Imagazine, Inewspaper Ilibrary and Ilibrary have yes, no as values. Moreover, the esteem variables are ordinal variables with categories strongly agree, agree, disagree, and strongly disagree.

Further, we see that Job05 has 56 missing values.

Self esteem evaluation

Let concentrate on Esteem scores evaluated in 87.

  1. First do a quick summary over all the Esteem variables. Pay attention to missing values, any peculiar numbers etc. How do you fix problems discovered if there is any? Briefly describe what you have done for the data preparation.
esteem %>% 
  skim(Esteem87_1:Esteem87_10)
Data summary
Name Piped data
Number of rows 2431
Number of columns 46
_______________________
Column type frequency:
numeric 10
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Esteem87_1 0 1 1.38 0.50 1 1 1 2 4 ▇▅▁▁▁
Esteem87_2 0 1 1.40 0.50 1 1 1 2 4 ▇▅▁▁▁
Esteem87_3 0 1 3.58 0.54 1 3 4 4 4 ▁▁▁▅▇
Esteem87_4 0 1 1.50 0.53 1 1 1 2 4 ▇▇▁▁▁
Esteem87_5 0 1 3.53 0.60 1 3 4 4 4 ▁▁▁▆▇
Esteem87_6 0 1 1.59 0.56 1 1 2 2 4 ▆▇▁▁▁
Esteem87_7 0 1 1.72 0.58 1 1 2 2 4 ▅▇▁▁▁
Esteem87_8 0 1 3.10 0.74 1 3 3 4 4 ▁▃▁▇▅
Esteem87_9 0 1 3.06 0.74 1 3 3 4 4 ▁▃▁▇▅
Esteem87_10 0 1 3.37 0.65 1 3 3 4 4 ▁▁▁▇▇

I use skim to summarize all the Esteem variables from the 1987 round. There are no missing values in these variables.

Looking at the mean of these variables, we see that some variables have low means whereas others have a higher means. For instance, Esteem87_1, Esteem87_2, Esteem87_4, Esteem87_6, and Esteem87_7 have low means whereas rest of the variables have higher means. In order to understand, why these patterns are present, I review the variable description labels provided above. I notice that there is a fundamental difference in the nature of the two groups of questions: In particular, esteem questions that have low means are associated with positive qualities and hence likely to have been responded in more agreeable terms (i.e lower numbers which show agreement or strong agreement) whereas those with a higher means are associated with negative qualities and hence a greater likelihood of being responded in terms of disagreement (and thus likely to have been responded in terms of higher numbers which indicate disagreement). Consequently, this creates difference in the order of the qualities being measured. One way to fix this is to reverse the order of one set of variables such that all variables are in the same order of positive/negative qualities. The next question asks us to make these changes.

  1. Reverse Esteem 1, 2, 4, 6, and 7 so that a higher score corresponds to higher self-esteem. (Hint: if we store the esteem data in data.esteem, then data.esteem[, c(1, 2, 4, 6, 7)] <- 5 - data.esteem[, c(1, 2, 4, 6, 7)] to reverse the score.)

To reverse the esteem scores, I use the following set of codes:

#esteem[, c(1, 2, 4, 6, 7)] <- 5 - esteem[, c(1, 2, 4, 6, 7)]
esteem.rev <- esteem %>% mutate(
          Esteem87_1 = 5 - Esteem87_1,
          Esteem87_2 = 5 - Esteem87_2,
          Esteem87_4 = 5 - Esteem87_4,
          Esteem87_6 = 5 - Esteem87_6,
          Esteem87_7 = 5 - Esteem87_7)
esteem.rev %>% 
  skim(Esteem87_1:Esteem87_10)
Data summary
Name Piped data
Number of rows 2431
Number of columns 46
_______________________
Column type frequency:
numeric 10
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Esteem87_1 0 1 3.62 0.50 1 3 4 4 4 ▁▁▁▅▇
Esteem87_2 0 1 3.60 0.50 1 3 4 4 4 ▁▁▁▅▇
Esteem87_3 0 1 3.58 0.54 1 3 4 4 4 ▁▁▁▅▇
Esteem87_4 0 1 3.50 0.53 1 3 4 4 4 ▁▁▁▇▇
Esteem87_5 0 1 3.53 0.60 1 3 4 4 4 ▁▁▁▆▇
Esteem87_6 0 1 3.41 0.56 1 3 3 4 4 ▁▁▁▇▆
Esteem87_7 0 1 3.28 0.58 1 3 3 4 4 ▁▁▁▇▅
Esteem87_8 0 1 3.10 0.74 1 3 3 4 4 ▁▃▁▇▅
Esteem87_9 0 1 3.06 0.74 1 3 3 4 4 ▁▃▁▇▅
Esteem87_10 0 1 3.37 0.65 1 3 3 4 4 ▁▁▁▇▇

Note that the reversal creates two categories of variables. The first category consists of variables associated with positive qualities and the reversal changes the response from 1 strongly disagree, 2 disagree, 3 agree and 4 strongly agree. The re-ordering gives a higher number to a higher evaluation of the self-esteem

The second category of variables which are associated with negative qualities and continue to follow the exsitng order since those continue to give a higher number to the higher self-esteem evaluation.

  1. Write a brief summary with necessary plots about the 10 esteem measurements.

I present two separate summaries of these variables categories discussed above. From the graphs, it can be seen that most respondents in the survey report a higher self esteem on both these categories. There are some exceptions though, for instance, a substantial fraction of respondents wished that they could have more self-respect (Esteem87_8) while several others reported that thet feel useless at times (Esteem87_9).

# First I summarize Esteem87_1 Esteem87_2,Esteem87_4, Esteem87_6, and Esteem87_7. 
esteem.rev %>%
  select(Esteem87_1, Esteem87_2, Esteem87_4,
         Esteem87_6:Esteem87_7)%>%
  pivot_longer(
    cols = c(Esteem87_1, Esteem87_2, Esteem87_4,
             Esteem87_6:Esteem87_7),
               names_to = 'variable',values_to = 'value') %>%
  mutate(variable = str_to_title(variable)) %>%
  ggplot() +
  geom_bar(mapping=aes(y=value)) +
  labs(y='',
       x='Count') +
  facet_wrap(~variable, scales = 'free',nrow = 3) +
  theme_minimal()

# First I summarize Esteem87_1 Esteem87_2, Esteem87_6, and Esteem87_7. 
esteem.rev %>%
  select(Esteem87_3,Esteem87_5, Esteem87_8:Esteem87_10)%>%
  pivot_longer(
    cols = c(Esteem87_3:Esteem87_5, Esteem87_8:Esteem87_10),
               names_to = 'variable',values_to = 'value') %>%
  mutate(variable = str_to_title(variable)) %>%
  ggplot() +
  geom_bar(mapping=aes(y=value)) +
  labs(y='',
       x='Count') +
  facet_wrap(~variable, scales = 'free',nrow = 3) +
  theme_minimal()

  1. Do esteem scores all positively correlated? Report the pairwise correlation table and write a brief summary.

Looking at the correlation table, the first thing to note that all esteem variables are positively correlated.Furthermore, we find that respondents’ self worth is somewhat strongly correlated with the number of good qualities; respondents’ who “take a positive attitude towards myself and others” are somewhat strongly correlated with their personal satisfaction.

# correlation for all variables
esteem.pca <- esteem.rev %>%
  select(Esteem87_1:Esteem87_10)

corr <-round(cor(esteem.pca),
  digits = 2 # rounded to 2 decimals
)

corr 
##             Esteem87_1 Esteem87_2 Esteem87_3 Esteem87_4 Esteem87_5 Esteem87_6
## Esteem87_1        1.00       0.70       0.45       0.53       0.40       0.46
## Esteem87_2        0.70       1.00       0.44       0.55       0.40       0.48
## Esteem87_3        0.45       0.44       1.00       0.41       0.55       0.41
## Esteem87_4        0.53       0.55       0.41       1.00       0.38       0.51
## Esteem87_5        0.40       0.40       0.55       0.38       1.00       0.40
## Esteem87_6        0.46       0.48       0.41       0.51       0.40       1.00
## Esteem87_7        0.38       0.41       0.34       0.42       0.37       0.60
## Esteem87_8        0.27       0.28       0.35       0.30       0.38       0.41
## Esteem87_9        0.24       0.26       0.35       0.29       0.35       0.36
## Esteem87_10       0.31       0.33       0.46       0.37       0.44       0.44
##             Esteem87_7 Esteem87_8 Esteem87_9 Esteem87_10
## Esteem87_1        0.38       0.27       0.24        0.31
## Esteem87_2        0.41       0.28       0.26        0.33
## Esteem87_3        0.34       0.35       0.35        0.46
## Esteem87_4        0.42       0.30       0.29        0.37
## Esteem87_5        0.37       0.38       0.35        0.44
## Esteem87_6        0.60       0.41       0.36        0.44
## Esteem87_7        1.00       0.39       0.35        0.39
## Esteem87_8        0.39       1.00       0.43        0.44
## Esteem87_9        0.35       0.43       1.00        0.58
## Esteem87_10       0.39       0.44       0.58        1.00
  1. PCA on 10 esteem measurements. (centered but no scaling)

After this preceding checks, we now perform a PCA over the Esteem variables. Since these values all have the same scale from 1 to 4, there is no need to scale, however, we center each Esteem category.

esteem.pca <- esteem.rev %>% select(Esteem87_1:Esteem87_10)
pca_esteem87 <- prcomp(esteem.pca, center=T, scale=F)  # by default, center=True but scale=FALSE!!!
names(pca_esteem87) #check output 
## [1] "sdev"     "rotation" "center"   "scale"    "x"
# rotation: loadings
# x: PC scores
# sdev: standard dev of the PC scores pc.4$sdev
# pc.4$center

The Table bellow reports the loadings of all the principle components.

pca_esteem87.loading <- pca_esteem87$rotation
knitr::kable(pca_esteem87.loading)
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Esteem87_1 0.235 -0.374 0.057 -0.012 0.393 -0.010 0.201 -0.352 -0.066 -0.691
Esteem87_2 0.244 -0.367 0.062 0.028 0.377 -0.009 0.151 -0.342 -0.006 0.720
Esteem87_3 0.279 -0.149 0.121 -0.436 -0.078 -0.012 0.595 0.568 0.105 0.023
Esteem87_4 0.261 -0.321 0.067 0.135 0.264 -0.124 -0.608 0.488 0.334 -0.040
Esteem87_5 0.312 -0.131 0.061 -0.592 -0.361 0.404 -0.407 -0.257 -0.074 0.002
Esteem87_6 0.313 -0.209 -0.071 0.357 -0.248 -0.032 -0.034 0.225 -0.782 0.016
Esteem87_7 0.299 -0.163 -0.122 0.497 -0.512 0.205 0.203 -0.153 0.502 -0.034
Esteem87_8 0.393 0.332 -0.821 -0.116 0.211 -0.047 -0.003 0.006 0.026 0.003
Esteem87_9 0.398 0.578 0.442 0.211 0.282 0.428 0.023 0.054 -0.032 -0.008
Esteem87_10 0.376 0.260 0.284 -0.084 -0.222 -0.770 -0.060 -0.234 0.054 -0.007
  1. Report the PC1 and PC2 loadings. Are they unit vectors? Are they orthogonal?

The loadings of PC1 and PC2, which we also verify are orthogonal.

# Loadings
pca_loadings <- pca_esteem87$rotation

# Prepare table with loadings for PC1 and PC2
tab_loadings <- cbind( c( 1:10 ), pca_loadings[ , 1:2 ] )
knitr::kable( tab_loadings, 
              col.names = c( 'Esteem Type', 'PC1', 'PC2' ),
              caption = 'PC1 and PC2 loadings' ) %>%
  kableExtra::kable_styling()
PC1 and PC2 loadings
Esteem Type PC1 PC2
Esteem87_1 1 0.235 -0.374
Esteem87_2 2 0.244 -0.367
Esteem87_3 3 0.279 -0.149
Esteem87_4 4 0.261 -0.321
Esteem87_5 5 0.312 -0.131
Esteem87_6 6 0.313 -0.209
Esteem87_7 7 0.299 -0.163
Esteem87_8 8 0.393 0.332
Esteem87_9 9 0.398 0.578
Esteem87_10 10 0.376 0.260
b) Are there good interpretations for PC1 and PC2? (If loadings are all negative, take the positive loadings for the ease of interpretation)

From the loading values in the table, we can say that PC1 represents an average of all self-esteem scores and PC2 represents the difference between qualities which can be termed as self-respect (Esteem questions 8, 9 and 10) and those which can be categorised as associated with self-perception(1-7).

c) How is the PC1 score obtained for each subject? Write down the formula.

We can summarize the PC1 score of a Subject \(n\) with the following formula: \[PC1_n = 0.235*E1_n+0.244*E2_n+0.279*E3_n+0.261*E4_n+0.312*E5_n+0.313*E6_n+0.299*E7_n+0.393*E8_n+0.398*E9_n+0.376*E10_n.\]

d) Are PC1 scores and PC2 scores in the data uncorrelated? 
   

The table below shows that PC1 and PC2 are uncorrelated:

# Check that PCs are uncorrelated
cor(pca_esteem87$x ) %>% round( 0 )
##      PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
## PC1    1   0   0   0   0   0   0   0   0    0
## PC2    0   1   0   0   0   0   0   0   0    0
## PC3    0   0   1   0   0   0   0   0   0    0
## PC4    0   0   0   1   0   0   0   0   0    0
## PC5    0   0   0   0   1   0   0   0   0    0
## PC6    0   0   0   0   0   1   0   0   0    0
## PC7    0   0   0   0   0   0   1   0   0    0
## PC8    0   0   0   0   0   0   0   1   0    0
## PC9    0   0   0   0   0   0   0   0   1    0
## PC10   0   0   0   0   0   0   0   0   0    1
e) Plot PVE (Proportion of Variance Explained) and summarize the plot. 

Now, we check the proportion of variance explained by each PC using the Scree plot of PVE and the Cumulative Scree Plot of PVE below. We can see that the first two PC scores explain about 60% of the variance in the data.

summary(pca_esteem87)$importance 
##                          PC1   PC2    PC3   PC4    PC5    PC6   PC7    PC8
## Standard deviation     1.297 0.678 0.5717 0.520 0.4610 0.4330 0.375 0.3668
## Proportion of Variance 0.466 0.127 0.0905 0.075 0.0588 0.0519 0.039 0.0373
## Cumulative Proportion  0.466 0.593 0.6837 0.759 0.8175 0.8694 0.908 0.9457
##                           PC9   PC10
## Standard deviation     0.3499 0.2713
## Proportion of Variance 0.0339 0.0204
## Cumulative Proportion  0.9796 1.0000
plot(pca_esteem87)

plot(summary(pca_esteem87)$importance[2, ], # PVE
ylab="PVE",
xlab="Number of PCs",
pch = 16,
main="Scree Plot of PVE for Esteem87")

f) Also plot CPVE (Cumulative Proportion of Variance Explained). What proportion of the variance in the data is explained by the first two principal components?
plot(summary(pca_esteem87)$importance[3, ], # PVE
ylab="Cumulative PVE",
xlab="Number of PCs",
pch = 16,
main="Scree Plot of PVE for Esteem87")

g) PC’s provide us with a low dimensional view of the self-esteem scores. Use a biplot with the first two PC's to display the data.  Give an interpretation of PC1 and PC2 from the plot. (try `ggbiplot` if you could, much prettier!)

The biplot graph shown below enables us to display the PC1 vs PC2 for each subject. We can verify that PC2 has different signs for Esteem questions 1-7 and 8-10, whereas the sign is the same for all scores for PC1.

## biplot pcs
ggbiplot( pca_esteem87, 
          obs.scale = 1, 
          var.scale = 1, alpha = 0.3 ) +
  geom_hline( yintercept = 0, size = 0.5, color = 'blue', linetype = 'dotted' ) +
  geom_vline( xintercept = 0, size = 0.5, color = 'blue', linetype = 'dotted' ) +
  scale_color_discrete(name = '') +
  labs( title = 'PC1 vs PC2 by Subjects and Loadings of Esteem Questions' ) 

  1. Apply k-means to cluster subjects on the original esteem scores

After doing a Principal Component Analysis, we will do some clustering analysis to assess our data. We first try to assess what would be an optimal number of clusters. Using the elbow criteria, we cluster our data into two groups.

a) Find a reasonable number of clusters using within sum of squared with elbow rules.
esteem_org <- esteem %>% select(Esteem87_1:Esteem87_10)

set.seed(0)
# function to compute total within-cluster sum of square
wss <- function(df, k) {
kmeans(df, k, nstart = 10)$tot.withinss
}
# Compute and plot wss for k = 1 to k = 15
k.values <- 2:20
# extract wss for 2-15 clusters using sapply
wss_values <- sapply(k.values,
function(k) kmeans(esteem_org[,-1], centers = k)$tot.withinss)
# or use map_dbl()
#wss_values <- map_dbl(k.values, function(k) wss(esteem_org[,-1], k))
plot(k.values, wss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares")

#set.seed(0)
fviz_nbclust(esteem_org[,-1], kmeans, method = "wss")

b) Can you summarize common features within each cluster?
# two clusters
clus_esteem87 <- kmeans(esteem_org, centers = 2 )

# get clusters to original data
esteem$clus <- clus_esteem87$cluster
# get pcs to original data
esteem$pc1_esteem87 <- pca_esteem87$x[ , 1]
esteem$pc2_esteem87 <- pca_esteem87$x[ , 2]

cluster_pcplot <- 
  ggplot( esteem ) +
  geom_hline( yintercept = 0, color = 'blue', size = 0.5, linetype = 'dotted' ) +
  geom_vline( xintercept = 0, color = 'blue', size = 0.5, linetype = 'dotted' ) +
  geom_point( aes( x = pc1_esteem87, y = pc2_esteem87, 
                   color = factor( clus ), 
                   shape = factor( clus ) ) ) +
  labs( title = 'PC1 vs PC2 by cluster', x = 'PC1', y = 'PC2' ) +
  scale_color_manual( name = '', 
                      values = c( 'black', 'red' ),
                      labels = c( 'Cluster 1', 'Cluster 2' ) ) +
  scale_shape_manual( name = '', 
                      values = c( 1, 19 ),
                      labels = c( 'Cluster 1', 'Cluster 2' ) ) +
  scale_x_continuous( breaks = seq( -5, 5, 1 ), limits = c( -4, 4 ) ) +
  scale_y_continuous( breaks = seq( -4, 4, 1 ), limits = c( -4, 4 ) ) 

print( cluster_pcplot )

c) Can you visualize the clusters with somewhat clear boundaries? You may try different pairs of variables and different PC pairs of the esteem scores.

The plot of PC1 vs PC2 above presents a clear division between Cluster 1 and Cluster 2 in terms of their PC scores. The first PC score divides the sample in two groups: one (red) with PC1 greater than 0, and another one with PC1 smaller than 0. Therefore, we may say that Cluster 2 is composed by Subjects that reported higher scores of Self-Esteem.

  1. We now try to find out what factors are related to self-esteem? PC1 of all the Esteem scores is a good variable to summarize one’s esteem scores. We take PC1 as our response variable.

    1. Prepare possible factors/variables:
    • EDA the data set first.

    • Personal information: gender, education (05), log(income) in 87, job type in 87. Weight05 (lb) and HeightFeet05 together with Heightinch05. One way to summarize one’s weight and height is via Body Mass Index which is defined as the body mass divided by the square of the body height, and is universally expressed in units of kg/m². Note, you need to create BMI first. Then may include it as one possible predictor.

    • Household environment: Imagazine, Inewspaper, Ilibrary, MotherEd, FatherEd, FamilyIncome78. Do set indicators Imagazine, Inewspaper and Ilibrary as factors.

    • You may use PC1 of ASVAB as level of intelligence

In order to examine associations between overall self-esteem (summarized by the PC1 score) and other relevant variables, we begin the EDA by grouping variables into two groups based on the personal and family characteristics of the respondents. Next, we check some basic associations between self-esteem and these variables using scatter plots, histograms or boxplots.

# PCA asvab
asvab <- esteem %>% select( Science, Arith, Word, Parag, Number, Coding, Auto, Math, Mechanic, Elec)
pca_asvab <- prcomp(asvab, 
                   center = TRUE,
                   scale = TRUE)
esteem$pc1_asvab <- pca_asvab$x[ , 1 ]

#create BMI 
esteem <- esteem %>% mutate (
             BMI =  Weight05*0.454  / ( HeightFeet05*0.3048 + HeightInch05*0.0254)^2, 
             Male = factor( if_else( Gender == 'male', 1, 0 ) ),
             Esteem     = pc1_esteem87,
             ASVAB      = pc1_asvab,
             Imagazine  = factor(Imagazine), 
             Inewspaper = factor(Inewspaper), 
             Ilibrary   = factor(Ilibrary), 
             MotherEd, FatherEd,
             logFamIncome = log(FamilyIncome78)
)
# Personal characteristics
ggIncome <- 
  ggplot( esteem, aes( x = log(Income87), y = Esteem ) ) +
  geom_point( ) +
  geom_smooth( method = 'lm', se = F ) +
  scale_x_continuous( breaks = seq( -10, 15, 1 ),
                       name = 'Income in 2005' ) +
  scale_y_continuous( breaks = seq( -5, 5, 1 ) ) +
  labs( title = 'Income vs Esteem PC1' ) 
 

ggEduc <- 
  ggplot( esteem, aes( x = Education05, y = Esteem ) ) +
  geom_point( ) +
  geom_smooth( method = 'lm', se = F ) +
  scale_x_continuous( breaks = seq( 0, 20, 2 ),
                      name = 'Years of schooling completed (2005)' ) +
  labs( title = 'Education vs Esteem PC1' ) +
  scale_y_continuous( breaks = seq( -5, 5, 1 ) ) 

ggIntel <- 
  ggplot( esteem, aes( x = ASVAB, y = Esteem ) ) +
  geom_point( ) +
  geom_smooth( method = 'lm', se = F ) +
  scale_x_continuous( breaks = seq( -10, 10, 2 ),
                      name = 'ASVAB score (PC1)' ) +
  labs( title = 'ASVAB score (PC1) vs Esteem PC1' ) +
  scale_y_continuous( breaks = seq( -5, 5, 1 ) ) 

ggBMI <- 
  ggplot(data = esteem %>% filter( BMI < 50 ),aes( x = BMI, y = Esteem ) ) +
  geom_point( ) +
  geom_smooth( method = 'lm', se = F ) +
  scale_x_continuous( breaks = seq( 10, 200, 5 ),
                      name = 'BMI (kg/m^2)' ) +
  labs( title = 'BMI vs Esteem PC1' ) +
  scale_y_continuous( breaks = seq( -5, 5, 1 ) ) 

ggGender <- 
  ggplot(esteem) +
  geom_boxplot( aes( x = factor( Male ), y = Esteem ) ) +
  scale_x_discrete( labels = c( 'Female', 'Male' ), name = '' ) +
  labs( title = 'Gender vs Esteem PC1' ) 


print(ggIncome)

print(ggEduc)

print(ggIntel)

print(ggBMI)

print(ggGender)

ggMagaz <- 
  ggplot(esteem) +
  geom_boxplot( aes( x = factor( Imagazine ), y = Esteem ) ) +
  scale_x_discrete( labels = c( 'No', 'Yes' ),
                    name = 'At least a member of household read magazines in 1979' ) +
  labs( title = 'Magazine reader in the household vs Esteem PC1' ) 

ggNewsp <- 
  ggplot(esteem) +
  geom_boxplot( aes( x = factor( Inewspaper ), y = Esteem ) ) +
  scale_x_discrete( labels = c( 'No', 'Yes' ),
                    name = 'At least a member of household read newspapers in 1979') +
  labs( title = 'Newspaper reader in the household vs Esteem PC1' ) 

ggLib <- 
  ggplot(esteem) +
  geom_boxplot( aes( x = factor( Ilibrary ), y = Esteem ) ) +
  scale_x_discrete( labels = c( 'No', 'yes' ),
                    name = 'At least a member of household with a library card in 1979' ) +
  labs( title = 'Library card holder in the household vs Esteem PC1' ) 


ggParentsEd <- esteem %>%
  select(MotherEd, FatherEd)%>%
  pivot_longer(
    cols = c(MotherEd, FatherEd),
               names_to = 'variable',values_to = 'value') %>%
  mutate(variable = str_to_title(variable)) %>%
  ggplot() +
  geom_bar(mapping=aes(y=value)) +
  labs(y='',
       x='Count') +
  facet_wrap(~variable, scales = 'free',nrow = 3) +
  theme_minimal()

ggFamIncome <- 
  ggplot( esteem, aes( x = logFamIncome, y = Esteem ) ) +
  geom_point( ) +
  geom_smooth( method = 'lm', se = F ) +
  scale_x_continuous( breaks = seq( -10, 15, 1 ),
                       name = 'log(Family Income in 1979)' ) +
  scale_y_continuous( breaks = seq( -5, 5, 1 ) ) +
  labs( title = 'Family Income vs Esteem PC1' ) 


print(ggMagaz)

print(ggNewsp)

print(ggLib)

print(ggParentsEd)

print(ggFamIncome)

Based on the EDA, I decide to run four regression models with esteem as the response variable. The model specifications are provided below:

  • Model 1: Esteem vs Personal characteristics (Gender + BMI);
  • Model 2: Esteem vs Personal characteristics + Education and Intelligence (Education, ASVAB score);
  • Model 3: Esteem vs Personal characteristics + Education and Intelligence + Income (log(Income in 87));
  • Model 4: Esteem vs Personal characteristics + Education and Intelligence+ Income + Household Environment (Imagazine, Inwespaper, Ilibrary, MotherEd,and FatherEd).

We decided not to include Family Income variable due to its likely collinearity with parental education. Further, based on the scatterplot of self-esteem and BMI, we restrict the value of BMI to less than 50 in order to remove a few outliers

b)   Run a few regression models between PC1 of all the esteem scores and suitable variables listed in a). Find a final best model with your own criterion. 
# Model 1: personal characteristics (sex + BMI)
model1 <- 
  lm( data = esteem %>% filter( BMI < 50 ),
      Esteem ~ 
        Male + BMI )        
# Model 2: personal characteristics + intelligence and income (+ Education05 + ASVAB + logIncome)
model2 <- 
  lm( data = esteem %>% filter( BMI < 50 ),
      Esteem ~ 
        Male + BMI +
        Education05 + ASVAB) 

model3<-
  lm( data = esteem %>% filter( BMI < 50 ),
      Esteem ~ 
        Male + BMI +
        Education05 + ASVAB+ Income87)

# Model 4: personal characteristics + intelligence + family characteristics (+ Imagazine + Inewspaper + Ilibrary + MotherEd + FatherEd)
model4 <-
  lm(data=esteem %>% filter( BMI < 50 ),
     Esteem ~ 
       Male + BMI +
       Education05 + ASVAB +
       Imagazine +  Inewspaper + Ilibrary + FatherEd + MotherEd+ Income87)

stargazer::stargazer( 
  model1, model2, model3, model4,
  #type = output_format, 
  # ci=TRUE, ci.level = .95,
  keep.stat = c( "n", "rsq","sigma2", "ser" ) )
% Table created by stargazer v.5.2.3 by Marek Hlavac, Social Policy Institute. E-mail: marek.hlavac at gmail.com % Date and time: Sun, Feb 12, 2023 - 18:15:06
  - How did you land this model? Run a model diagnosis to see if the linear model assumptions are reasonably met. 
par(mfrow=c(1,2))
plot(model1, 1)
plot(model1, 2)

par(mfrow=c(1,2))
plot(model2, 1)
plot(model2, 2)

par(mfrow=c(1,2))
plot(model3, 1)
plot(model3, 2)

par(mfrow=c(1,2))
plot(model4, 1)
plot(model4, 2)

par( mfrow = c( 1, 2 ), 
     mar   = c( 5, 2, 4, 2 ), 
     mgp   = c( 3, 0.5, 0 ) ) # plot(fit3) produces several plots
plot( model4, 1, pch = 16) # residual plot
abline( h = 0, col = "blue", lwd = 2 )
plot( model4, 2 ) # qqplot

  - Write a summary of your findings. In particular, explain what and how the variables in the model affect one's self-esteem. 
    

Looking at the regression table in part (b) we see several things:

  • In model 1 which includes only BMI and gender, we see that being a male and a higher BMI score significantly positively associated with higher reported self-esteem.

  • In model 2, inclusion of education and intelligence score makes BMI irrelevant for reporting a higher self-esteem, but male, education and intelligence are now significantly positively associated.

  • In model 3, we also include the respondents income in 87, and find that being male no longer has significant association.

  • In model 4, we also include variables for family characteristics. We see that an individual’s level of education, intelligence, Income, and household access to newspapers are significantly positively associated with their self-esteem evaluation. In particular, each extra year of schooling increases PC1 score by 0.067 units, one point increase in ASVAB score increases the Esteem PC1 scores in 0.073 units. Further, controlling for all other variables, individuals in the household with access to newspapers were likley to have a higher PC1 score by 0.15 units relative to households. The magnitude of the increase in PC1 of self-esteem is very small for incomes.

2 Case study 2: Breast cancer sub-type

The Cancer Genome Atlas (TCGA), a landmark cancer genomics program by National Cancer Institute (NCI), molecularly characterized over 20,000 primary cancer and matched normal samples spanning 33 cancer types. The genome data is open to public from the Genomic Data Commons Data Portal (GDC).

In this study, we focus on 4 sub-types of breast cancer (BRCA): basal-like (basal), Luminal A-like (lumA), Luminal B-like (lumB), HER2-enriched. The sub-type is based on PAM50, a clinical-grade luminal-basal classifier.

  • Luminal A cancers are low-grade, tend to grow slowly and have the best prognosis.
  • Luminal B cancers generally grow slightly faster than luminal A cancers and their prognosis is slightly worse.
  • HER2-enriched cancers tend to grow faster than luminal cancers and can have a worse prognosis, but they are often successfully treated with targeted therapies aimed at the HER2 protein.
  • Basal-like breast cancers or triple negative breast cancers do not have the three receptors that the other sub-types have so have fewer treatment options.

We will try to use mRNA expression data alone without the labels to classify 4 sub-types. Classification without labels or prediction without outcomes is called unsupervised learning. We will use K-means and spectrum clustering to cluster the mRNA data and see whether the sub-type can be separated through mRNA data.

We first read the data using data.table::fread() which is a faster way to read in big data than read.csv().

brca <- fread(here('data','brca_subtype.csv'))

# get the sub-type information
brca_subtype <- brca$BRCA_Subtype_PAM50
brca <- brca[,-1]
num_gene <- ncol(brca)
dim(brca)
## [1]  1160 19947
  1. Summary and transformation

    1. How many patients are there in each sub-type?

Answer: 1160 patients.

b) Randomly pick 5 genes and plot the histogram by each sub-type.
# randomly select 5 gene
set.seed(10)
sample_idx <- sample(num_gene, 5)  

brca %>% 
  select(all_of(sample_idx)) %>%      # select column by index
  pivot_longer(cols = everything()) %>%     # for facet(0)
  ggplot(aes(x = value, y = ..density..)) +
  geom_histogram(aes(fill = name)) +
  facet_wrap(~name, scales = "free") +
  theme_bw() +
  theme(legend.position = "none")

c) Remove gene with zero count and no variability. Then apply logarithmic transform.
# remove genes with 0 counts
sel_cols <- which(colSums(abs(brca)) != 0)
brca_sub <- brca[, sel_cols, with=F]
dim(brca_sub)
## [1]  1160 19669
# log to make it spread out
brca_sub <- log2(as.matrix(brca_sub+1e-10)) 
  1. Apply kmeans on the transformed dataset with 4 centers and output the discrepancy table between the real sub-type brca_subtype and the cluster labels.
system.time({brca_sub_kmeans <- kmeans(x = brca_sub, 4)})
##    user  system elapsed 
##  11.288   0.187  11.479
# save the results as RDS
saveRDS(brca_sub_kmeans, "brca_kmeans.RDS")
brca_sub_kmeans <- readRDS("brca_kmeans.RDS")
# discrepancy table 

table(brca_subtype, brca_sub_kmeans$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal   1  17   3 187
##        Her2   39   9  26  17
##        LumA  392  82 154   0
##        LumB   98  22 111   2
  1. Spectrum clustering: to scale or not to scale?

    1. Apply PCA on the centered and scaled dataset. How many PCs should we use and why? You are encouraged to use irlba::irlba().

Answer: 4 PCs should be used based on the elbow rule because further increasing number of PCs will not give us much more information (PVE improvement lower than 0.03).

pca_ret <- prcomp(brca_sub, center = T, scale. = T)

# save the result
# only save a few leading PCs
pca_ret$rotation <- pca_ret$rotation[, 1:20]   
pca_ret$x <- pca_ret$x[, 1:20]
saveRDS(pca_ret, here('data',"brca_pca.RDS"))

pca_ret <- readRDS(here('data',"brca_pca.RDS"))

# Plot scree plot of PVE
pve <- summary(pca_ret)$importance[2, 1:10]
plot(pve, type="b", pch = 19, frame = FALSE)

b) Plot PC1 vs PC2 of the centered and scaled data and PC1 vs PC2 of the centered but unscaled data side by side. Should we scale or not scale for clustering process? Why? (Hint: to put plots side by side, use `gridExtra::grid.arrange()` or `ggpubr::ggrrange()` or `egg::ggrrange()` for ggplots; use `fig.show="hold"` as chunk option for base plots)

Answer: clustering based on unscaled data are preferred. Rescaling the data could cause information loss, particularly if the original data had crucial information that was not captured by the mean and standard deviation. Besides, rescaling implicitly assumes that the distribution is normal, which is usually not the case in the distribution of genes. Unscaled data preserve the original information and help tp produce more distinguishable clusters.

# center and scale the data
brca_sub_scaled_centered <- scale(as.matrix(brca_sub), center = T, scale = T)

svd_ret <- irlba::irlba(brca_sub_scaled_centered, nv = 2)

pc_score <- (svd_ret$u[, 1:2])*(svd_ret$d[1:2])  # from svd 

# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)

p1 <- data.table(x = pc_score[,1], 
                y = pc_score[,2],
                col = as.factor(brca_subtype),
                cl = as.factor(kmean_ret$cluster)) %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, col = col, shape = cl)) +
  scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
                     values = scales::hue_pal()(4)) +
  scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4"),
                     values = c(4,8, 10, 16)) + 
  theme_bw() +
  labs(color = "Cancer type", shape = "Cluster") +
  xlab("PC1") +
  ylab("PC2")

brca_sub_unscaled_centered <- scale(as.matrix(brca_sub), center = T, scale = F)

svd_ret <- irlba::irlba(brca_sub_unscaled_centered, nv = 2)

pc_score <- (svd_ret$u[, 1:2])*(svd_ret$d[1:2])  # from svd 

# apply kmean
kmean_ret <- kmeans(x = pc_score, 4)

p2 <- data.table(x = pc_score[,1], 
                y = pc_score[,2],
                col = as.factor(brca_subtype),
                cl = as.factor(kmean_ret$cluster)) %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, col = col, shape = cl)) +
  scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
                     values = scales::hue_pal()(4)) +
  scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4"),
                     values = c(4,8, 10, 16)) + 
  theme_bw() +
  labs(color = "Cancer type", shape = "Cluster") +
  xlab("PC1") +
  ylab("PC2")
p2

ggpubr::ggarrange(p1, p2, ncol = 2, nrow = 1)

  1. Spectrum clustering: center but do not scale the data

    1. Use the first 4 PCs of the centered and unscaled data and apply kmeans. Find a reasonable number of clusters using within sum of squared with the elbow rule.

    Based on the elbow rule, we would choose 5 as the reasonable number of clusters.

svd_ret <- irlba::irlba(brca_sub_unscaled_centered, nv = 4)

pc_score <- (svd_ret$u[, 1:4])*(svd_ret$d[1:4])  # from svd 

set.seed(10)

# function to compute total within-cluster sum of square 
wss <- function(df, k) {
  kmeans(df, k, nstart = 10)$tot.withinss
}


k.values <- 2:15

wss_values <- sapply(k.values, 
                     function(k) kmeans(brca_sub_unscaled_centered, centers = k)$tot.withinss)

plot(k.values, wss_values,
       type="b", pch = 19, frame = FALSE, 
       xlab="Number of clusters K",
       ylab="Total within-clusters sum of squares")

b) Choose an optimal cluster number and apply kmeans. Compare the real sub-type and the clustering label as follows: Plot scatter plot of PC1 vs PC2. Use point color to indicate the true cancer type and point shape to indicate the clustering label. Plot the kmeans centroids with black dots. Summarize how good is clustering results compared to the real sub-type.

Answer: The optimal cluster number does not perform well in identifying real sub-types. Though cluster 1 and 3 pretty much identify “Basal”, there is a mix of LumA and lumB in cluster 4 or 5.

# apply kmean
kmean_ret <- kmeans(x = pc_score, 5)

# Extract the centroids
centroids <- data.frame(x = kmean_ret$centers[,1], y = kmean_ret$centers[,2])


p <- data.table(x = pc_score[,1], 
                y = pc_score[,2],
                col = as.factor(brca_subtype),
                cl = as.factor(kmean_ret$cluster)) %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, col = col, shape = cl))  +
  scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
                     values = scales::hue_pal()(4)) +
  scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4", "Cluster 5"),
                     values = c(4,8, 10, 17, 20)) + 
  theme_bw() +
  labs(color = "Cancer type", shape = "Cluster") +
  xlab("PC1") +
  ylab("PC2")
p +  geom_point(data=centroids,x = centroids$x, y = centroids$y,  color = "black", size = 3, shape = 16)

c) Compare the clustering result from applying kmeans to the original data and the clustering result from applying kmeans to 4 PCs. Does PCA help in kmeans clustering? What might be the reasons if PCA helps?

The kmeans clustering using 4PCs separates the four types almost as well as the kmeans clustering with original data. The betweenss value also increases with the clustering analysis based on 4 PCs, suggesting that the four clusters are even more separate from one another. Pre-process the data with PCA before cluster analysis can help to reduce the dimension of the data, making it easier to analyze. As I plotted in Q3 above, based on the plot of the proportion of Variance Explained (PVE), 4 PCs already capture most of the variance in the original data. This signals that 4 PCs already capture the most important feature of the data and it help with efficiency of cluster analysis.

svd_ret <- irlba::irlba(brca_sub_unscaled_centered, nv = 4)

pc_score <- (svd_ret$u[, 1:4])*(svd_ret$d[1:4])  # from svd 
kmean_ret4 <- kmeans(x = pc_score, 4)

## rely on PC
table(brca_subtype, brca_sub_kmeans$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal   1  17   3 187
##        Her2   39   9  26  17
##        LumA  392  82 154   0
##        LumB   98  22 111   2
## based on original data
table(brca_subtype, kmean_ret4$cluster)
##             
## brca_subtype   1   2   3   4
##        Basal   2 189   1  16
##        Her2   28  26  28   9
##        LumA  401   0 153  74
##        LumB   94   1 116  22
table(brca_sub_kmeans$betweenss, kmean_ret4$betweenss)
##                   
##                    50176178.6515725
##   67530386.0843437                1
d) Now we have an x patient with breast cancer but with unknown sub-type. We have this patient's mRNA sequencing data. Project this x patient to the space of PC1 and PC2. (Hint: remember we remove some gene with no counts or no variablity, take log and centered) Plot this patient in the plot in iv) with a black dot. Calculate the Euclidean distance between this patient and each of centroid of the cluster. Can you tell which sub-type this patient might have? 

Euclidean distance between this patient and each of centroid of the cluster are 579.1515, 548.4973, 509.4033, 109.9861, respectively. The patient might be having Basal sub-type of breast cancer.

x_patient <- fread(here('data','brca_x_patient.csv'))
new_df <- rbind(brca, x_patient)

sel_cols2 <- which(colSums(abs(new_df)) != 0)
x_patient_sub <- new_df[, sel_cols2, with=F]
dim(x_patient_sub)
## [1]  1161 19669
# log to make it spread out
x_patient_sub <- log2(as.matrix(x_patient_sub+1e-10)) 

patient_sub_centered <- scale(as.matrix(x_patient_sub), center = T, scale = F)
svd_ret2 <- irlba::irlba(patient_sub_centered, nv = 2)

pc_score <- patient_sub_centered %*% svd_ret2$v[, 1:2]
x_patient_pc <-pc_score[1161, ]
x_patient_pc <- data.frame(x = x_patient_pc[1], y = x_patient_pc[2])
 
# apply kmean 
kmean_ret <- kmeans(x = pc_score, 4)

p <- data.table(x = pc_score[,1], 
                y = pc_score[,2],
                col = as.factor(brca_subtype),
                cl = as.factor(kmean_ret$cluster)) %>%
  ggplot() + 
  geom_point(aes(x = x, y = y, col = col, shape = cl))  +
  scale_color_manual(labels = c("Basal", "Her2", "LumA", "LumB"),
                     values = scales::hue_pal()(4)) +
  scale_shape_manual(labels = c("Clulster 1", "Cluster 2","Cluster 3", "Cluster 4"),
                     values = c(4,8, 10, 17)) + 
  theme_bw() +
  labs(color = "Cancer type", shape = "Cluster") +
  xlab("PC1") +
  ylab("PC2")
p +  geom_point(data=x_patient_pc,x = x_patient_pc$x, y = x_patient_pc$y,  color = "black", size = 3, shape = 16)

# Calculate the Euclidean distances between items and the corresponding centroid
distance1 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[1,])^2))
distance2 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[2,])^2))
distance3 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[3,])^2))
distance4 <- sqrt(rowSums((x_patient_pc - kmean_ret$centers[4,])^2))
# View the distances
distance1;distance2; distance3;distance4
## [1] 553
## [1] 500
## [1] 141
## [1] 577

3 Case study 3: Auto data set (leave for next week)

This question utilizes the Auto dataset from ISLR. The original dataset contains 408 observations about cars. It is similar to the CARS dataset that we use in our lectures. To get the data, first install the package ISLR. The Auto dataset should be loaded automatically. We’ll use this dataset to practice the methods learn so far. Original data source is here: https://archive.ics.uci.edu/ml/datasets/auto+mpg

Get familiar with this dataset first. Tip: you can use the command ?ISLR::Auto to view a description of the dataset.

3.1 EDA

Explore the data, with particular focus on pairwise plots and summary statistics. Briefly summarize your findings and any peculiarities in the data.

3.2 What effect does time have on MPG?

  1. Start with a simple regression of mpg vs. year and report R’s summary output. Is year a significant variable at the .05 level? State what effect year has on mpg, if any, according to this model.

  2. Add horsepower on top of the variable year to your linear model. Is year still a significant variable at the .05 level? Give a precise interpretation of the year’s effect found here.

  3. The two 95% CI’s for the coefficient of year differ among (i) and (ii). How would you explain the difference to a non-statistician?

  4. Create a model with interaction by fitting lm(mpg ~ year * horsepower). Is the interaction effect significant at .05 level? Explain the year effect (if any).

3.3 Categorical predictors

Remember that the same variable can play different roles! Take a quick look at the variable cylinders, and try to use this variable in the following analyses wisely. We all agree that a larger number of cylinders will lower mpg. However, we can interpret cylinders as either a continuous (numeric) variable or a categorical variable.

  1. Fit a model that treats cylinders as a continuous/numeric variable. Is cylinders significant at the 0.01 level? What effect does cylinders play in this model?

  2. Fit a model that treats cylinders as a categorical/factor. Is cylinders significant at the .01 level? What is the effect of cylinders in this model? Describe the cylinders effect over mpg.

  3. What are the fundamental differences between treating cylinders as a continuous and categorical variable in your models?

  4. Can you test the null hypothesis: fit0: mpg is linear in cylinders vs. fit1: mpg relates to cylinders as a categorical variable at .01 level?

3.4 Results

Final modeling question: we want to explore the effects of each feature as best as possible. You may explore interactions, feature transformations, higher order terms, or other strategies within reason. The model(s) should be as parsimonious (simple) as possible unless the gain in accuracy is significant from your point of view.

  1. Describe the final model. Include diagnostic plots with particular focus on the model residuals and diagnoses.

  2. Summarize the effects found.

  3. Predict the mpg of the following car: A red car built in the US in 1983 that is 180 inches long, has eight cylinders, displaces 350 cu. inches, weighs 4000 pounds, and has a horsepower of 260. Also give a 95% CI for your prediction.

4 Simple Regression through simulations

4.1 Linear model through simulations

This exercise is designed to help you understand the linear model using simulations. In this exercise, we will generate \((x_i, y_i)\) pairs so that all linear model assumptions are met.

Presume that \(\mathbf{x}\) and \(\mathbf{y}\) are linearly related with a normal error \(\boldsymbol{\varepsilon}\) , such that \(\mathbf{y} = 1 + 1.2\mathbf{x} + \boldsymbol{\varepsilon}\). The standard deviation of the error \(\varepsilon_i\) is \(\sigma = 2\).

We can create a sample input vector (\(n = 40\)) for \(\mathbf{x}\) with the following code:

# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)

4.1.1 Generate data

Create a corresponding output vector for \(\mathbf{y}\) according to the equation given above. Use set.seed(1). Then, create a scatterplot with \((x_i, y_i)\) pairs. Base R plotting is acceptable, but if you can, please attempt to use ggplot2 to create the plot. Make sure to have clear labels and sensible titles on your plots.

Here we generate the data

# Set seed to ensure replicabilty
set.seed(42)
# Generates a vector of size 40 with equally spaced values between 0 and 1, inclusive
x <- seq(0, 1, length = 40)
# Create y based on the model equation
y <- 1 + 1.2*x + rnorm(40,0,2)

Now we can visualize it

ggplot() +
  geom_point(mapping=aes(x=x,y=y)) +
  labs(title = TeX('Plotting $x$ Values Against $y$ Values Generated from the Model $y = 1 + 1.2x + \\epsilon$')) +
  theme_minimal()

The graph reveals that because x only ranges from 0 to 1, the highly dispersed errors (recall that \(\varepsilon\) has a standard deviation of 2) heavily obscure the relationship between x and y. This does not necessarily pose a problem for estimating the true coefficients of the model, aside from the fact that the estimates will be noisy (i.e. large standard errors).

4.1.2 Understand the model

  1. Find the LS estimates of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\), using the lm() function. What are the true values of \(\boldsymbol{\beta}_0\) and \(\boldsymbol{\beta}_1\)? Do the estimates look to be good?

Here is how we estimate the coefficients

# Fit linear model
modelQ412 <- lm(y ~ 1 + x)
# Display model summary
summary(modelQ412)
## 
## Call:
## lm(formula = y ~ 1 + x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.397 -0.909 -0.297  1.766  4.163 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)     2.19       0.73    3.01   0.0047 **
## x              -1.35       1.26   -1.07   0.2905   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.35 on 38 degrees of freedom
## Multiple R-squared:  0.0294, Adjusted R-squared:  0.00381 
## F-statistic: 1.15 on 1 and 38 DF,  p-value: 0.29

The estimates are not very good. The true intercept is 1, but the model estimates it as 2.19. Similarly, the true slope for x is 1.2 and the model estimates it as -1.35.

  1. What is your RSE for this linear model fit? Is it close to \(\sigma = 2\)?

The estimated standard error of the residuals is 2.35 not too far from the true value of 2.

  1. What is the 95% confidence interval for \(\boldsymbol{\beta}_1\)? Does this confidence interval capture the true \(\boldsymbol{\beta}_1\)?

Because the sample size is rather small (\(N=40\)) we should use the Student t critical values rather than the z values from the Normal distribution.

# Get critical value
critVal <- abs(qt(0.025,38)) # Recall that the Student t distribution is symmetrical 

We can now compute the 95% confidence intervals as:

# Upper bounds
coef(modelQ412) + critVal*sqrt(diag(vcov(modelQ412)))
## (Intercept)           x 
##        3.67        1.20
# Lower bounds
coef(modelQ412) - critVal*sqrt(diag(vcov(modelQ412)))
## (Intercept)           x 
##       0.716      -3.892

Which is precisely the result we would have obtained using the confint function.

confint(modelQ412)
##              2.5 % 97.5 %
## (Intercept)  0.716   3.67
## x           -3.892   1.20
  1. Overlay the LS estimates and the true lines of the mean function onto a copy of the scatterplot you made above.

Here is the required plot.

We can see that the estimated line is rather far from the real one.

4.1.3 diagnoses

  1. Provide a residual plot where fitted \(\mathbf{y}\)-values are on the x-axis and residuals are on the y-axis.

Here is the residuals plot

There are no clear signs of violation of homoskedasticity (because the errors are indeed homoskedastic).

  1. Provide a normal QQ plot of the residuals.

Here is the Normal QQ plot.

qqnorm(resid(modelQ412))
qqline(resid(modelQ412))

Normally distributed residuals should fall long the line. There is some evidence that there could be a deviation from Normality, especially in the left tail (though we know this is not the case).

  1. Comment on how well the model assumptions are met for the sample you used.

We know that at the population level the assumptions hold but in our specific sample, we could be concerned about non-Normality of the errors. Homoskedasticty seems satisfied even the sample.

4.2 Understand sampling distribution and confidence intervals

This part aims to help you understand the notion of sampling statistics and confidence intervals. Let’s concentrate on estimating the slope only.

Generate 100 samples of size \(n = 40\), and estimate the slope coefficient from each sample. We include some sample code below, which should guide you in setting up the simulation. Note: this code is easier to follow but suboptimal; see the appendix for a more optimal R-like way to run this simulation.

# Inializing variables. Note b_1, upper_ci, lower_ci are vectors
x <- seq(0, 1, length = 40) 
n_sim <- 100              # number of simulations
b1 <- 0                   # n_sim many LS estimates of beta_1 (=1.2). Initialize to 0 for now
upper_ci <- 0             # upper bound for beta_1. Initialize to 0 for now.
lower_ci <- 0             # lower bound for beta_1. Initialize to 0 for now.
t_star <- qt(0.975, 38)   # Food for thought: why 38 instead of 40? What is t_star?

# Perform the simulation
for (i in 1:n_sim){I l
  y <- 1 + 1.2 * x + rnorm(40, sd = 2)
  lse <- lm(y ~ x)
  lse_output <- summary(lse)$coefficients
  se <- lse_output[2, 2]
  b1[i] <- lse_output[2, 1]
  upper_ci[i] <- b1[i] + t_star * se
  lower_ci[i] <- b1[i] - t_star * se
}
results <- as.data.frame(cbind(se, b1, upper_ci, lower_ci))

# remove unecessary variables from our workspace
rm(se, b1, upper_ci, lower_ci, x, n_sim, b1, t_star, lse, lse_out) 

We will try a more compact approach with the code block below. We could have been less verbose but we preferred a more self-explanatory approach. We make use of the wonderful purr and broom libraries part of the tidyverse.

generate_y <- function(x) {
  y <- 1 + 1.2*x + rnorm(40,0,2)
}
  
fit_model <- function(x,y) {
  fittedModel <- lm(y ~ 1 + x)
}

extract_coefficients <- function(fittedModel,thisSimulation) {
  fittedModel %>% 
    broom::tidy() %>%
    mutate(simNum = thisSimulation)
}

clean_coef_table <- function(x,thisSimulation) {
  y <- generate_y(x)
  fittedModel <- fit_model(x,y)
  cleanCoefs <- extract_coefficients(fittedModel,thisSimulation)
}

cleanedCoefficients <- map_dfr(1:100, ~ clean_coef_table(x,.x))
  1. Summarize the LS estimates of \(\boldsymbol{\beta}_1\) (stored in results$b1). Does the sampling distribution agree with theory?

Here is a density plot of the centered estimated slope for x (\(\beta_1\)) compared with the theoretical distribution, a Student t with 38 degrees of freedom. The two distributions are barely distinguishable. Notice that we had to center the sampling distribution at 0 (by subtracting its mean).

  1. How many of your 95% confidence intervals capture the true \(\boldsymbol{\beta}_1\)? Display your confidence intervals graphically.

Let us first build the confidence intervals and add them as new columns to our dataframe.

cleanedCoefficients <- cleanedCoefficients %>%
  mutate(upperBound = estimate + critVal*std.error,
         lowerBound = estimate - critVal*std.error)

Now let’s visualize them

Each horizontal line represents the 95% confidence interval for the slope in one of the samples. The solid vertical line shows the true slope. We can see that most of the confidence intervals contain the true slope. Let us make this statement more precise.

# We compute the proportion of intervals that contain the true value.
propContains <- cleanedCoefficients %>%
  filter(term=='x') %>%
  mutate(containsTrueVal = (lowerBound <= 1.2 & upperBound >= 1.2)/n()) %>%
  pull(containsTrueVal) %>%
  sum()

We find that 0.96 of the intervals contain the true value, very close to the expected proportion (0.95).